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Abstract 

We provide a unifying framework linking two 
classes of statistics used in two-sample and 
independence testing: on the one hand, the 
energy distances and distance covariances 
from the statistics literature; on the other, 
distances between embeddings of distribu- 
tions to reproducing kernel Hilbert spaces 
(RKHS), as established in machine learning. 
The equivalence holds when energy distances 
are computed with semimetrics of negative 
type, in which case a kernel may be defined 
such that the RKHS distance between dis- 
tributions corresponds exactly to the energy 
distance. We determine the class of proba- 
bility distributions for which kernels induced 
by semimetrics are characteristic (that is, for 
which embeddings of the distributions to an 
RKHS are injective). Finally, we investigate 
the performance of this family of kernels in 
two-sample and independence tests: we show 
in particular that the energy distance most 
commonly employed in statistics is just one 
member of a parametric family of kernels, 
and that other choices from this family can 
yield more powerful tests. 

1. Introduction 

The problem of testing statistical hypotheses in high 
dimensional spaces is particularly challenging, and has 
been a recent focus of considerable work in the statis- 
tics and machine learning communities. On the sta- 
tistical side, two-sample testing in Euclidean spaces 
(of whether two independent samples are from the 
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same distribution, or from different distributions) can 
be accomplished using a so-called energy distance as 
a statistic (Szekely & Rizzo, 2004; 2005). Such tests 
are consistent against all alternatives as long as the 
random variables have finite first moments. A re- 
lated dependence measure between vectors of high 
dimension is the distance covariance (Szekely et al., 
2007; Szekely & Rizzo, 2009), and the resulting test is 
again consistent for variables with bounded first mo- 
ment. The distance covariance has had a major im- 
pact in the statistics community, with Szekely & Rizzo 
(2009) being accompanied by an editorial introduc- 
tion and discussion. A particular advantage of energy 
distance-based statistics is their compact representa- 
tion in terms of certain expectations of pairwise Eu- 
clidean distances, which leads to straightforward em- 
pirical estimates. As a follow-up work, Lyons (2011) 
generalized the notion of distance covariance to metric 
spaces of negative type (of which Euclidean spaces are 
a special case). 

On the machine learning side, two-sample tests have 
been formulated based on embeddings of probability 
distributions into reproducing kernel Hilbert spaces 
(Gretton et al., 2012), using as the test statistic the 
difference between these embeddings: this statistic 
is called the maximum mean discrepancy (MMD). 
This distance measure was applied to the prob- 
lem of testing for independence, with the associ- 
ated test statistic being the Hilbert-Schmidt Indepen- 
dence Criterion (HSIC) (Gretton et al., 2005a; 2008; 
Smola et al., 2007; Zhang et al., 2011). Both tests 
are shown to be consistent against all alternatives 
when a characteristic RKHS is used (Fukumizu et al., 
2008; Sriperumbudur et al., 2010). Such tests can fur- 
ther be generalized to structured and non-Euclidean 
domains, such as text strings, graphs or groups 
(Fukumizu et al., 2009). 

Despite their striking similarity, the link between 
energy distance-based tests and kernel-based tests 
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has been an open question. In the discussion 
of Szekely & Rizzo (2009), Gretton et al. (2009b, 
p. 1289) first explored this link in the context of 
independence testing, and stated that interpreting 
the distance-based independence statistic as a kernel 
statistic is not straightforward, since Bochner's theo- 
rem does not apply to the choice of weight function 
used in the definition of Brownian distance covari- 
ance (we briefly review this argument in Section A. 3 
of the Appendix). Szekely & Rizzo (2009, Rejoinder, 
p. 1303) confirmed this conclusion, and commented 
that RKHS-based dependence measures do not seem to 
be formal extensions of Brownian distance covariance 
because the weight function is not integrable. Our con- 
tribution resolves this question and shows that RKHS- 
based dependence measures are precisely the formal 
extensions of Brownian distance covariance, where the 
problem of non-integrability of weight functions is cir- 
cumvented by using translation- variant kernels, i.e., 
distance-induced kernels, a novel family of kernels that 
we introduce in Section 2.2. 

In the case of two-sample testing, we demonstrate that 
energy distances are in fact maximum mean discrepan- 
cies arising from the same family of distance-induced 
kernels. A number of interesting consequences arise 
from this insight: first, we show that the energy dis- 
tance (and distance covariance) derives from a partic- 
ular parameter choice from a larger family of kernels: 
this choice may not yield the most sensitive test. Sec- 
ond, results from Gretton et al. (2009a); Zhang et al. 
(2011) may be applied to get consistent two-sample 
and independence tests for the energy distance, with- 
out using bootstrap, which perform much better than 
the upper bound proposed by Szekely et al. (2007) as 
an alternative to the bootstrap. Third, in relation to 
Lyons (2011), we obtain a new family of characteristic 
kernels arising from semimetric spaces of negative type 
(where the triangle inequality need not hold), which 
are quite unlike the characteristic kernels defined via 
Bochner's theorem (Sriperumbudur et al., 2010). 

The structure of the paper is as follows: In Section 
2, we provide the necessary definitions from RKHS 
theory, and the relation between RKHS and semimet- 
rics of negative type. In Section 3.1, we review both 
the energy distance and distance covariance. We re- 
late these quantities in Sections 3.2 and 3.3 to the 
Maximum Mean Discrepancy (MMD) and the Hilbert- 
Schmidt Independence Criterion (HSIC), respectively. 
We give conditions for these quantities to distinguish 
between probability measures in Section 4, thus ob- 
taining a new family of characteristic kernels. Empir- 
ical estimates of these quantities and associated two- 
sample and independence tests are described in Sec- 



tion 5. Finally, in Section 6, we investigate the per- 
formance of the test statistics on a variety of testing 
problems, which demonstrate the strengths of the new 
kernel family. 

2. Definitions and Notation 

In this section, we introduce concepts and notation 
required to understand reproducing kernel Hilbert 
spaces (Section 2.1), and distribution embeddings into 
RKHS. We then introduce semimetrics (Section 2.2), 
and review the relation of semimetrics of negative type 
to RKHS kernels. 

2.1. RKHS Definitions 

Unless stated otherwise, we will assume that Z is any 
topological space. 

Definition 1. (RKHS) Let H be a Hilbert space 
of real-valued functions defined on Z. A function 
k : Z x Z — > R is called a reproducing kernel of % 
if (i) Vz e Z, k(-,z) e H, and (ii) Vz E Z, V/ e 
H, (f,k{-,z)) H = f(z). If H has a reproducing kernel, 
it is called a reproducing kernel Hilbert space (RKHS) . 

According to the Moore- Aronszajn theorem 
(Berlinet & Thomas- Agnan, 2004, p. 19), for every 
symmetric, positive definite function k : Z x Z — > R, 
there is an associated RKHS Hk of real-valued 
functions on Z with reproducing kernel k. The map 
ip : Z — > Hk, (f : z <-> k(-,z) is called the canonical 
feature map or the Aronszajn map of k. We will say 
that k is a nondegenerate kernel if its Aronszajn map 
is injective. 

2.2. Semimetrics of Negative Type 

We will work with the notion of semimetric of nega- 
tive type on a non-empty set Z, where the "distance" 
function need not satisfy the triangle inequality. Note 
that this notion of semimetric is different to that which 
arises from the seminorm, where distance between two 
distinct points can be zero (also called pseudonorm). 

Definition 2. (Semimetric) Let Z be a non-empty 
set and let p : Z x Z — > [0, oo) be a function such 
that Wz,z' G Z, (i) p{z,z') = if and only if z = z' , 
and (ii) p(z,z') = p(z',z). Then (Z,p) is said to be a 
semimetric space and p is called a semimetric on Z. If, 
in addition, (iii) \/z,z',z" £ Z, p[z! ,z") < p(z,z') + 
p{z,z"), (Z,p) is said to be a metric space and p is 
called a metric on Z. 

Definition 3. (Negative type) The semimetric 
space (Z,p) is said to have negative type if Vn > 2, 



Hypothesis Testing Using Pairwise Distances and Associated Kernels 



Zi,..., z„ G Z, and a\, . . . , a n £l with J27=i a » = 0' 

n n 
i=l j=l 

Note that in the terminology of Berg et al. (1984), p 
satisfying (1) is said to be a negative definite func- 
tion. The following theorem is a direct consequence of 
Berg et al. (1984, Proposition 3.2, p. 82). 

Proposition 4. p is a semimetric of negative type 
if and only if there exists a Hilbert space W and an 
injective map ip : Z —> H, such that 



p{z,z') = Mz)-y(z')t n 



(2) 



This shows that (M. d , \\- — -|| 2 ) is of negative type. 
From Berg et al. (1984, Corollary 2.10, p. 78), we have 
that: 

Proposition 5. If p satisfies (1), then so does p q , for 
< q < 1. 

Therefore, by taking q = 1/2, we conclude that all 
Euclidean spaces are of negative type. While Lyons 
(2011, p. 9) also uses the result in Proposition 4, he 
studies embeddings to general Hilbert spaces, and the 
relation with the theory of reproducing kernel Hilbert 
spaces is not exploited. Semimetrics of negative type 
and symmetric positive definite kernels are in fact 
closely related, as summarized in the following Lemma 
based on Berg et al. (1984, Lemma 2.1, p. 74). 

Lemma 6. Let Z be a nonempty set, and let p be a 
semimetric on Z. Let zq G Z, and denote k(z,z') = 
p(z, zq) + p(z', zq)~p(z, z'). Then k is positive definite 
if and only if p satisfies (1). 

We call the kernel k defined above the distance-induced 
kernel, and say that it is induced by the semimetric 
p. For brevity, we will drop "induced" hereafter, and 
say that k is simply the distance kernel (with some 
abuse of terminology). In addition, we will typically 
work with distance kernels scaled by 1/2. Note that 
k(zQ, zq) = 0, so distance kernels are not strictly pos- 
itive definite (equivalently, fc(-,zo) = 0). By vary- 
ing "the point at the center" zq, one obtains a fam- 
ily K,p = {i [p(z,z ) + p(z',z ) - p{z,z')]} zaeZ of dis- 
tance kernels induced by p. We may now express (2) 
from Proposition 4 in terms of the canonical feature 
map for the RKHS Hk (proof in Appendix A.l). 

Proposition 7. Let (Z, p) be a semimetric space of 
negative type, and k G JC p . Then: 

1. k is nondegenerate, i.e., the Aronszajn map z \— > 
k(-, z) is infective. 



2. p{z,z r ) = k(z,z) + k(z',z') - 2k(z,z') = 

\\k(;z)-k(;Z>)\\ 2 Hk . 

Note that Proposition 7 implies that the Aronszajn 
map z i y k{-,z) is an isometric embedding of a metric 
space (Zj/o 1 / 2 ) into Hk, for every k G fC p . 



2.3. Kernels Inducing Semimetrics 

We now further develop the link between semimetrics 
of negative type and kernels. Let k be any nonde- 
generate reproducing kernel on Z (for example, every 
strictly positive definite k is nondegenerate). Then, by 
Proposition 4, 

p(z, z') = k{z, z) + k(z', z') - 2k{z, z') (3) 

defines a valid semimetric p of negative type on Z. 
We will say that k generates p. It is clear that every 
distance kernel k G K, p also generates p, and that k 
can be expressed as: 

k(z, z') = k(z, z') + k(z , z ) - k(z, z ) - k(z' , z ), (4) 

for some zq £ Z. In addition, k G 1C P if and only if 
k(zo,zo) = for some zq G Z. Hence, it is clear that 
any strictly positive definite kernel, e.g., the Gaussian 

kernel e~ a \\ z ~ z II , is not a distance kernel. 
Example 8. Let Z = M. d and write p q {z,z') = 
\\z — z'\\ q . By combining Propositions 4 and 5, p q is a 
valid semimetric of negative type for < q < 2. It is 
a metric of negative type if q < 1. The corresponding 
distance kernel "centered at zero" is given by 



k q {z,z') = \(M\ q + 



(5) 



Example 9. Let Z 

kernel k{z, z') = er a 
is p(z, z') = 2 



—cjWz—z 



, and consider the Gaussian 
II . The induced semimetric 
. There are many other 



kernels that generate p, however; for example, the dis- 
tance kernel induced by p and "centered at zero" is 

k(z,z') = e-HI^'H 2 + l - e -^IN 2 - e Hkir. 

3. Distances and Covariances 

In this section, we begin with a description of the en- 
ergy distance, which measures distance between dis- 
tributions; and distance covariance, which measures 
dependence. We then demonstrate that the former is 
a special instance of the maximum mean discrepancy 
(a kernel measure of distance on distributions), and 
the latter an instance of the Hilbert-Schmidt Indepen- 
dence criterion (a kernel dependence measure). We 
will denote by M.(Z) the set of all finite signed Borel 
measures on Z, and by hA\{Z) the set of all Borel 
probability measures on Z. 
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3.1. Energy Distance and Distance Covariance 

Szekely & Rizzo (2004; 2005) use the following mea- 
sure of statistical distance between two probability 
measures P and Q on M. d , termed the energy distance: 

D E (P, Q) = 2E ZW \\Z - W\\ - E zz , \\Z - Z'\\ 

-E WW ,\\W-W'\\, (6) 

where Z, Z' U ~' P and W, W *'~ ' Q. This quantity 
characterizes the equality of distributions, and in the 
scalar case, it coincides with twice the Cramer- Von 
Mises distance. We may generalize it to a semimetric 
space of negative type (Z,p), with the expression for 
this generalized distance covariance De, p (P,Q) being 
of the same form as (6) , with the Euclidean distance 
replaced by p. Note that the negative type of p im- 
plies the non-negativity of De, p - In Section 3.2, we 
will show that for every p, De, p is precisely the MMD 
associated to a particular kernel k on Z. 

Now, let X be a random vector on W and Y a ran- 
dom vector on K 9 . The distance covariance was intro- 
duced in Szekely et al. (2007); Szekely & Rizzo (2009) 
to address the problem of testing and measuring de- 
pendence between X and Y , in terms of a weighted L2- 
distance between characteristic functions of the joint 
distribution of X and Y and the product of their 
marginals. Given a particular choice of weight func- 
tion, it can be computed in terms of certain expecta- 
tions of pairwise Euclidean distances, 

V 2 (A, Y) = E X yE x >y> \\X - X'\\ \\Y - Y'\\ (7) 

+e x e x , \\x - x'\\ e y e y , \\y - y'\\ 

-2E X , Y , [E x \\X - X'\\ E Y \\Y - Y'\\] , 

where (X, Y) and (X',Y r ) are P XY . Recently, 
Lyons (2011) established that the generalization of the 
distance covariance is possible to metric spaces of neg- 
ative type, with the expression for this generalized dis- 
tance covariance V 2 X py (X, Y) being of the same form 
as (7), with Euclidean distances replaced by metrics of 
negative type px and py on domains X and Y, respec- 
tively. In Section 3.3, we will show that the general- 
ized distance covariance of a pair of random variables 
X and Y is precisely HSIC associated to a particular 
kernel k on the product of domains of X and Y. 

3.2. Maximum Mean Discrepancy 

The notion of the feature map in an RKHS (Sec- 
tion 2.1) can be extended to kernel embeddings 
of probability measures (Berlinet & Thomas-Agnan, 
2004; Sriperumbudur et al., 2010). 
Definition 10. (Kernel embedding) Let k be a ker- 
nel on Z, and P G Mi_(Z). The kernel embedding 



of P into the RKHS H k is Mfc(P) G Hk such that 
Ez~pf(Z) = (f^ k (P)) Uk for all / G H k . 

Alternatively, the kernel embedding can be defined by 
the Boclmer expectation Pk(P) = E z ~pk(-, Z). By 
the Riesz representation theorem, a sufficient condi- 
tion for the existence of Pk(P) is that k is Borel- 
measurable and that E z ^pk 1 l 2 {Z, Z) < oo. If A; is 
a bounded continuous function, this is obviously true 
for all P G M.\{Z). Kernel embeddings can be used to 
induce metrics on the spaces of probability measures, 
giving the maximum mean discrepancy (MMD), 

iliPQ) = \W(P) - Pk(Q)\\l k 

= E zz ,k{Z, Z') + Eww'HW, W) 

-2E zw k(Z,W), (8) 

where Z, Z' ' P and W, W ' Q. If the re- 
striction of pk to some ViZ) C M.\(Z) is well de- 
fined and injective, then k is said to be characteristic 
to V{Z\ and it is said to be characteristic (without 
further qualification) if it is characteristic to A4+(Z). 
When k is characteristic, 7^. is a metric on M+(Z), i.e., 
Ik (P, Q) = iff P = Q, VP, Q G M\{Z). Conditions 
under which kernels are characteristic have been stud- 
ied by Sriperumbudur et al. (2008); Fukumizu et al. 
(2009); Sriperumbudur et al. (2010). An alternative 
interpretation of (8) is as an integral probability met- 
ric (Miiller, 1997): see Gretton et al. (2012) for details. 

In general, distance kernels are continuous but un- 
bounded functions. Thus, kernel embeddings are not 
defined for all Borel probability measures, and one 
needs to restrict the attention to a class of Borel proba- 
bility measures for which Ez^pk 1 / 2 ^, Z) < 00 when 
discussing the maximum mean discrepancy. We will 
assume that all Borel probability measures considered 
satisfy a stronger condition that E z ^pk{Z, Z) < 00 
(this reflects a finite first moment condition on random 
variables considered in distance covariance tests, and 
will imply that all quantities appearing in our results 
are well defined). For more details, see Section A. 4 
in the Appendix. As an alternative to requiring this 
condition, one may assume that the underlying semi- 
metric space {Z, p) of negative type is itself bounded, 
i.e., that sup z z , eZ p(z, z') < 00. 

We are now able to describe the relation between the 
maximum mean discrepancy and the energy distance. 
The following theorem is a consequence of Lemma 6, 
and is proved in Section A.l of the Appendix. 

Theorem 11. Let (Z, p) be a semimetric space of neg- 
ative type and let zq G Z . The distance kernel k in- 
duced by p satisfies 7^(P, Q) = ^De, p (P,Q)- In par- 
ticular, 7/j does not depend on the choice of zq. 
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There is a subtlety to the link between kernels and 
semimetrics, when used in computing the distance on 
probabilities. Consider again the family of distance 
kernels K. p , where the semimetric p is itself generated 
from k according to (3). As we have seen, it may 
be that k ^ K p) however it is clear that 7 2 (P,(3) = 
\De.p{PiQ) whenever k generates p. Thus, all ker- 
nels that generate the same semimetric p on Z give 
rise to the same metric jk on (possibly a subset of) 
A4+(Z), and 7& is merely an extension of the met- 
ric p 1 / 2 on the point masses. The kernel-based and 
distance-based methods are therefore equivalent, pro- 
vided that we allow "distances" p which may not satisfy 
the triangle inequality. 

3.3. The Hilbert-Schmidt Independence 
Criterion 

Given a pair of jointly observed random variables 
(X, Y) with values in X x y, the Hilbert-Schmidt In- 
dependence Criterion (HSIC) is computed as the max- 
imum mean discrepancy between the joint distribution 
Pxy and the product of its marginals PxPy- Let kx 
and ky be kernels on X and y, with respective RKHSs 
T-Lk x and T-Lk y - Following Smola et al. (2007, Section 
2.3), we consider the MMD associated to the kernel 
k((x,y) ,(x',y')) = kx{x,x')ky(y,y') on X x y with 
RKHS Hk isometrically isomorphic to the tensor prod- 
uct U kx ® Hky It follows that 9 := jI(Pxy , PxPy) 
with 



Exr [k X (;X)®ky(;Y)] 

-E x k x {-,X)®E Y k y {-,Y) 



Hi, 



= E XY Ex> Y >kx(X,X')ky(Y,Y') 

+E.yE x < k x (X, X')E Y E Y ,k y {Y, Y') 
-2E X , Y , [E x kx(X,X')E Y ky(Y,Y')} , 

where in the last step we used that 

(f®9J'®9')H kx ®u ky = (f>f)u kx (9,9')u kx - 
It can be shown that this quantity is the squared 
Hilbert-Schmidt norm of the covariance operator be- 
tween RKHSs (Gretton et al., 2005b). The following 
theorem demonstrates the link between HSIC and the 
distance covariance, and is proved in Appendix A.l. 

Theorem 12. Let (X,px) and (y,py) be semimetric 
spaces of negative type, and (xo,yo) € X x y. Define 



k((x,y),(x' : y r j) 

:= [px(x,x ) + px(x',x ) -p x (x,x')] x 

[py(y, vo) + py(y', yo) - py(y, y')] ■ 



(9) 



We remark that a similar result to Theorem 12 is given 
by Lyons (2011, Proposition 3.16), but without mak- 
ing use of the RKHS equivalence. Theorem 12 is a 
more general statement, in the sense that we allow p 
to be a semimetric of negative type, rather than a met- 
ric. In addition to yielding a more general statement, 
the RKHS equivalence leads to a significantly simpler 
proof: the result is an immediate application of the 
HSIC expansion of Smola et al. (2007). 

4. Distinguishing Probability 
Distributions 

Lyons (2011, Theorem 3.20) shows that distance co- 
variance in a metric space characterizes independence 
if the metrics satisfy an additional property, termed 
strong negative type. We will extend this notion to 
a semimetric p. We will say that P £ M\(Z) has 
a finite first moment w.r.t. p if J p{z, zo)dP is finite 
for some z £ Z. It is easy to see that the integral 
/ p d ([P - Q] x [P - Q\) = -D E , P {P, Q) converges 
whenever P and Q have finite first moments w.r.t. 
p. In Appendix A. 4, we show that this condition is 
equivalent to Ez^pk(Z, Z) < 00, for a kernel k that 
generates p, which implies the kernel embedding pk (P) 
is also well defined. 

Definition 13. The semimetric space (Z,p) is said 
to have a strong negative type if VP, Q £ Ai\_(Z) with 
finite first moment w.r.t. p, 



P + Q 



pd([P-Q]x [P-Q]) < 0. 



(10) 



Then, k is a positive definite kernel on X x y, and 
yl(P X Y,PxPY) = V 2 pxpy (X,Y). 



The quantity in (10) is exactly -2~/%(P, Q) for all P, Q 
with finite first moment w.r.t. p. We directly obtain: 
Proposition 14. Let kernel k generate p. Then (Z, p) 
has a strong negative type if and only if k is character- 
istic to all probability measures with finite first moment 
w.r.t. p. 

Thus, the problems of checking whether a semimetric 
is of strong negative type and whether its associated 
kernel is characteristic to an appropriate space of Borel 
probability measures are equivalent. This conclusion 
has some overlap with Lyons (2011): in particular, 
Proposition 14 is stated in Lyons (2011, Proposition 
3.10), where the barycenter map j3 is a kernel embed- 
ding in our terminology, although Lyons does not con- 
sider distribution embeddings in an RKHS. 

5. Empirical Estimates and Hypothesis 
Tests 

In the case of two-sample testing, we are given i.i.d. 
samples z = {^i}™ 1 ~ P and w = {"u^}" =1 ~ Q. The 
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empirical (biased) V-statistic estimate of (8) is 



^ m m ^ n n 



i=l 3 = 1 



■i=l 3 = 1 



2 
mn 



i=i i=i 

Recall that if we use a distance kernel k induced by a 
semimetric p, this estimate involves only the pairwise 
p-distances between the sample points. 

In the case of independence testing, we are given i.i.d. 
samples z = {(a;,, J/j)}i =1 ~ Pxy-, an d the resulting 
V-statistic estimate (HSIC) is (Gretton et al., 2005a; 
2008) 

HSIC(z; k x ,k y ) = -^Tr{K x HK y H), (12) 
mr 

where Kx, Ky and H are m x m matrices given 
by [K x )ij ■■= kx(x i} Xj), (Ky)^ := ky{yi,y 3 ) and 
Hij = Sij — — (centering matrix). As in the two- 
sample case, if both kx and ky are distance kernels, 
the test statistic involves only the pairwise distances 
between the samples, i.e., kernel matrices in (12) may 
be replaced by distance matrices. 

We would like to design distance-based tests with 
an asymptotic Type I error of a, and thus we re- 
quire an estimate of the (1 — a)-quantile of the V- 
statistic distribution under the null hypothesis. Un- 
der the null hypothesis, both (11) and (12) con- 
verge to a particular weighted sum of chi-squared 
distributed independent random variables (for more 
details, see Section A. 2). We investigate two ap- 
proaches, both of which yield consistent tests: a 
bootstrap approach (Arcones & Gine, 1992), and a 
spectral approach (Gretton et al., 2009a; Zhang et al., 
2011). The latter requires empirical computation of 
the spectrum of kernel integral operators, a prob- 
lem studied extensively in the context of kernel PCA 
(Scholkopf et al., 1997). In the two-sample case, one 
computes the eigenvalues of the centred Gram ma- 
trix K = HKH on the aggregated samples. Here, 
if is a 2m x 2m matrix, with entries Kij = k(v,i,Uj), 
u = [z w] is the concatenation of the two samples 
and H is the centering matrix. Gretton et al. (2009a) 
show that the null distribution defined using these fi- 
nite sample estimates converges to the population dis- 
tribution, provided that the spectrum is square-root 
summable. The same approach can be used for a 
consistent finite sample null distribution of HSIC, via 
computation of the eigenvalues of Kx = HKxH and 
Ky = HKyH (Zhang et al, 2011). 

Both Szekely & Rizzo (2004, p. 14) and Szekely et al. 
(2007, p. 2782-2783) establish that the energy distance 



and distance covariance statistics, respectively, con- 
verge to a particular weighted sum of chi-squares of 
form similar to that found for the kernel-based statis- 
tics. Analogous results for the generalized distance co- 
variance are presented by Lyons (2011, p. 7-8). These 
works do not propose test designs that attempt to es- 
timate the coefficients in such representations of the 
null distribution, however (note also that these co- 
efficients have a more intuitive interpretation using 
kernels). Besides the bootstrap, Szekely et al. (2007, 
Theorem 6) also proposes an independence test using 
a bound applicable to a general quadratic form Q of 
centered Gaussian random variables with E[Q] = 1: 
P{Q > ($ _1 (1 - a /2) 2 )} < a, valid for < a < 
0.215. When applied to the distance covariance statis- 
tic, the upper bound of a is achieved if X and Y are 
independent Bernoulli variables. The authors remark 
that the resulting criterion might be over-conservative. 
Thus, more sensitive tests are possible by computing 
the spectrum of the centred Gram matrices associated 
to distance kernels, and we pursue this approach in the 
next section. 

6. Experiments 

6.1. Two-sample Experiments 

In the two-sample experiments, we investigate three 
different kinds of synthetic data. In the first, we com- 
pare two multivariate Gaussians, where the means dif- 
fer in one dimension only, and all variances are equal. 
In the second, we again compare two multivariate 
Gaussians, but this time with identical means in all 
dimensions, and variance that differs in a single dimen- 
sion. In our third experiment, we use the benchmark 
data of Sriperumbudur et al. (2009): one distribution 
is a univariate Gaussian, and the second is a univari- 
ate Gaussian with a sinusoidal perturbation of increas- 
ing frequency (where higher frequencies correspond to 
harder problems). All tests use a distance kernel in- 
duced by the Euclidean distance. As shown on the left 
plots in Figure 1, the spectral and bootstrap test de- 
signs appear indistinguishable, and they significantly 
outperform the test designed using the quadratic form 
bound, which appears to be far too conservative for 
the data sets considered. This is confirmed by check- 
ing the Type I error of the quadratic form test, which 
is significantly smaller than the test size of a = 0.05. 

We also compare the performance to that of the Gaus- 
sian kernel, with the bandwidth set to the median dis- 
tance between points in the aggregation of samples. 
We see that when the means differ, both tests perform 
similarly. When the variances differ, it is clear that the 
Gaussian kernel has a major advantage over the dis- 
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Means differ, a=0.05 Means differ, a=0.05 




Figure 1. (left) MMD using Gaussian and distance kernels 
for various tests; (right) Spectral MMD using distance ker- 
nels with various exponents. The number of samples in all 
experiments was set to m = 200. 

tance kernel, although this advantage decreases with 
increasing dimension (where both perform poorly) . In 
the case of a sinusoidal perturbation, the performance 
is again very similar. 

In addition, following Example 8, we investigate the 
performance of kernels obtained using the semimet- 
ric p(z,z') = \\z — z'\\ q for < q < 2. Results are 
presented in the right hand plots of Figure 1. While 
judiciously chosen values of q offer some improvement 
in the cases of differing mean and variance, we see 
a dramatic improvement for the sinusoidal perturba- 
tion, compared with the case q = 1 and the Gaus- 
sian kernel: values q = 1/3 (and smaller) yield vir- 
tually error-free performance even at high frequencies 
(note that q = 1 corresponds to the energy distance de- 
scribed in Szekely & Rizzo (2004; 2005)). Additional 
experiments with real- world data are presented in Ap- 
pendix A. 6. 

We observe from the simulation results that distance 
kernels with higher exponents are advantageous in 
cases where distributions differ in mean value along 
a single dimension (with noise in the remainder), 
whereas distance kernels with smaller exponents are 



rn=1024, d=4, a=0.05 m=512, a=0.05 




angle of rotation {x rt/4) frequency 



Figure 2. HSIC using distance kernels with various expo- 
nents and a Gaussian kernel as a function of (left) the angle 
of rotation for the dependence induced by rotation; (right) 
frequency i in the sinusoidal dependence example. 

more sensitive to differences in distributions at finer 
lengthscales (i.e., where the characteristic functions of 
the distributions differ at higher frequencies) . This ob- 
servation also appears to hold true on the real-world 
data experiments in Appendix A. 6. 

6.2. Independence Experiments 

To assess independence tests, we used an artificial 
benchmark proposed by Gretton et al. (2008): we gen- 
erate univariate random variables from the ICA bench- 
mark densities of Bach & Jordan (2002); rotate them 
in the product space by an angle between and 7r/4 to 
introduce dependence; fill additional dimensions with 
independent Gaussian noise; and, finally, pass the re- 
sulting multivariate data through random and inde- 
pendent orthogonal transformations. The resulting 
random variables X and Y are dependent but uncor- 
rected. The case m = 1024 (sample size) and d — 4 
(dimension) is plotted in Figure 2 (left). As observed 
by Gretton et al. (2009b), the Gaussian kernel does 
better than the distance kernel with q = 1. By vary- 
ing q, however, we are able to obtain a wide range of 
performance; in particular, the values q = 1/6 (and 
smaller) have an advantage over the Gaussian kernel 
on this dataset, especially in the case of smaller an- 
gles of rotation. As for the two-sample case, bootstrap 
and spectral tests have indistinguishable performance, 
and are significantly more sensitive than the quadratic 
form based test, which failed to reject the null hypoth- 
esis of independence on this dataset. 

In addition, we assess the test performance on sinu- 
soidally dependent data. The distribution over the 
random variable pair X, Y was drawn from Pxy oc 
1 + sin(£x) s'm(£y) for integer £, on the support X x y, 
where X := [— 7r, it] and y :~ [— tt, it]. In this way, in- 
creasing I caused the departure from a uniform (inde- 
pendent) distribution to occur at increasing frequen- 
cies, making this departure harder to detect from a 
small sample size. Results are in Figure 2 (right). 
We note that the distance covariance outperforms the 
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Gaussian kernel on this example, and that smaller ex- 
ponents result in better performance (lower Type II 
error when the departure from independence occurs 
at higher frequencies). Finally, we note that the set- 
ting q = 1, which is described in Szekely et al. (2007); 
Szekely & Rizzo (2009), is a reasonable heuristic in 
practice, but does not yield the most powerful tests 
on either dataset. 

7. Conclusion 

We have established an equivalence between the energy 
distance and distance covariance, and RKHS measures 
of distance between distributions. In particular, en- 
ergy distances and RKHS distance measures coincide 
when the kernel is induced by a semimetric of nega- 
tive type. The associated family of kernels performs 
well in two-sample and independence testing: interest- 
ingly, the parameter choice most commonly used in the 
statistics literature does not yield the most powerful 
tests in many settings. 

The interpretation of the energy distance and dis- 
tance covariance in an RKHS setting should be of 
considerable interest both to statisticians and machine 
learning researchers, since the associated kernels may 
be used much more widely: in conditional depen- 
dence testing and estimates of the chi-squared dis- 
tance (Fukumizu et al., 2008), in Bayesian inference 
(Fukumizu et al., 2011), in mixture density estimation 
(Sriperumbudur, 2011) and in other machine learning 
applications. In particular, the link with kernels makes 
these applications of the energy distance immediate 
and straightforward. Finally, for problem settings de- 
fined most naturally in terms of distances, and where 
these distances are of negative type, there is an in- 
terpretation in terms of reproducing kernels, and the 
learning machinery from the kernel literature can be 
brought to bear. 
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A. Appendix 
A.l. Proofs 

Proof. (Proposition 7) If z,z' € Z are such that 
k(w, z) = k(w, z'), for all w G Z, one would also have 
p(z, zo) — p(z, w) = p(z', zq) — p(z', w), for all w G Z. 
In particular, by inserting w = z, and w = z' , we ob- 
tain p{z, z') = —p(z,z') = 0, i.e., z = z' . The second 
statement follows readily by expressing k in terms of 
p. □ 

Proof. (Theorem 11) Follows directly by inserting 
the distance kernel from Lemma 6 into (8), and can- 
celling out the terms dependant on a single random 
variable. Define 8 := J%(P, Q). 

8 = ^E zz , [p{Z, z ) + p{Z', z ) - p(Z, Z')} 

+ \e ww , \p{W, z ) + p(W, z ) - p(W, W 1 )} 
-E zw \p{Z, z ) + p(W, z ) - p(Z, W)] 

= E ZWPi z, w) - ^Efi^l _ iwwin 

□ 

Proof. (Theorem 12) First, we note that is a 
valid reproducing kernel since k ({x, y) , (x', y')) = 
kx(x,x')ky(y,y'), where we have taken /^(a:, x') = 
p x (x,x ) + Px{x',xq) - p x (x,x'), and ky(y,y') = 
Py(v, Vo)+Py(y', yo)-py(y, y'), as distance kernels in- 
duced by px and py, respectively. Indeed, a product of 
two reproducing kernels is always a valid reproducing 
kernel on the product space (Steinwart & Christmann, 
2008, Lemma 4.6, p. 114). To show equality to 
distance covariance, we start by expanding 8 := 
jI(Pxy,PxPy), 

0i 

8 = E XY E x , Y 'kx(X,X')k y {Y,Y') 

02 

+ E x E x >kx(X, X')E Y E Y -ky{Y, Y') 
-2E X , Y , [E x kx(X, X')E Y k y (Y,Y')} . 

Note that 

6 1 =E XY E x , Y ,px{X,X , )py{Y,Y') 
+2E x p x (X,x )E Y p y (Y,y ) 

+2E XYPx (X, xo)py{Y 7 y ) 
-2E XY [px(X,x )E Y ,py(Y,Y')] 
-2E XY [py(Y,y )E x ,px(X,X')} , 



8 2 = E x E x ,p x (X,X')E Y E Y ,py(Y,Y') 
+4E x px(X, x )E Y py(Y,y ) 
-2E x px{X, x )E Y E Y , p y (Y, Y') 
-2E YPy (Y, y )E x E x >p x (X, X'), 

and 

8 3 = E X , Y , [E x px(X,X')E Y p y (Y,Y')} 
+3E x px{X,x )E Y py(Y,y ) 
+E XY p x (X, x )p y (Y, y ) 

-E XY \p x (X,x )E Y ,py(Y,Y')] 
-E XY [py{Y,y )E Xf px(X,X')] 
-E xPx (X, x )E Y E Y , Py (F, Y') 
-E YPy (F, y )E x E x , px (X, X'). 

The claim now follows by inserting the resulting expan- 
sions and cancelling the appropriate terms. Note that 
only the leading terms in the expansions remain. □ 

Remark 15. It turns out that k is not characteristic 
to M\(X x y) — i.e., it cannot distinguish between 
any two distributions on X x y, even if kx and ky 
are characteristic. However, since 7^ is equal to the 
Brownian distance covariance, we know that it can al- 
ways distinguish between any P XY and its product 
of marginals P X P Y in the Euclidean case. Namely, 
note that k((x ,y), (x ,y')) = k((x,y ),(x',y )) = 
for all x,x' G X 1 y,y' G y. That means that 
for every two distinct P Y ,Q Y G M\{y), one has 
~fl(S Xo P Y , 5 Xo Q Y ) = 0. Thus, kernel in (9) charac- 
terizes independence but not equality of probability 
measures on the product space. Informally speaking, 
the independence testing is an easier problem than ho- 
mogeneity testing on the product space. 

A. 2. Spectral Tests 

Assume that the null hypothesis holds, i.e., that P = 
Q. For a kernel k and a Borel probability mea- 
sure P, define a kernel "centred" at P: kp(z,z') := 
k{z, z') + E ww >k(W, W) - E w k(z, W) - E w k{z' ', W), 
with W, W P. Note that as a special case for 
P = S Zo we recover the family of kernels in (4), and 
tha,tE zz ,k P (Z, Z') = 0, i.e., p~ kp {P) = 0. The centred 
kernel is important in characterizing the null distribu- 
tion of the V-statistic. To the centred kernel kp on 
domain Z, one associates the integral kernel operator 
Sj, : Lp(Z) — > L 2 p(Z) (see Steinwart & Christmann, 
2008, p. 126-127), given by: 

S~ kp g(z)=[ k P (z,w)g(w)dP(w). (13) 
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The following theorem is a special case of 
Grettonetal. (2012, Theorem 12). For simplic- 
ity, we focus on the case where m = n. 

Theorem 16. Let Z = {ZJ™ 1 and W = {VF, }™ l be 
two i.i.d. samples from P G M.+ (Z), and let Sj, p be a 
trace class operator. Then 



(14) 



where Ni '~ ' 7V(0, 1), i G N, and {Xi}°° =1 are the 
eigenvalues of the operator S^ p . 

Note that this result requires that the integral ker- 
nel operator associated to the underlying probabil- 
ity measure P is a trace class operator, i.e., that 
^z~pk(Z, Z) < oo. As before, the sufficient condition 
for this to hold for all probability measures is that k 
is a bounded function. In the case of a distance ker- 
nel, this is the case if the domain Z has a bounded 
diameter with respect to the semimetric p, i.e., that 

The null distribution of HSIC takes an analogous form 
to (14) of a weighted sum of chi-squares, but with co- 
efficients corresponding to the products of the eigen- 
values of integral operators Sf. and Sf. . The fol- 
lowing Theorem is in Zhang et al. (2011, Theorem 4) 
and gives an asymptotic form for the null distribution 
of HSIC. See also Lyons (2011, Remark 2.9). 

Theorem 17. Let Z = {(X,-, Yi)}™. 1 be an i.i.d. sam- 
ple from Pxy = PxPy, with values in X x y. Let 



S h : L Px (X) -+ L Px (X), and S- t 



« p x r x v ' r x \ / > Kp Y 

L 2 Py (y) be trace class operators. Then 



r2 



mHSIC(Z; k x ,ky)^J2 J2 X ^J N l 



(15) 



= 1 3=1 



where Nij ~ A/"(0, 1), i,j G N, are independent and 
{Xi}^-1 and {'>lj}'^ =1 are the eigenvalues of the opera- 
tors Sj. p and Sj, p , respectively. 

Note that if X and y have bounded diameters w.r.t. px 
and py, Theorem 17 applies to distance kernels in- 
duced by px and py for all P\ G M\{X), Py G 

M\{y) . 



A. 3. A Characteristic Function Based 
Interpretation 

The distance covariance in (7) was defined by 
Szekely et al. (2007) in terms of a weighted distance 
between characteristic functions. We briefly review 
this interpretation here, however we show that this 



approach cannot be used to derive a kernel-based mea- 
sure of dependence (this result was first noted by 
Gretton et al. (2009b), and is included here in the in- 
terests of completeness). Let X be a random vector 
on X =W and Y a random vector on y — R 9 . The 
characteristic function of X and Y, respectively, will 
be denoted by fx and /y, and their joint characteris- 
tic function by fxY- The distance covariance V(X, Y) 
is defined via the norm of fxY — fx fy in a weighted 
L2 space on i.e., 



V 2 (X,Y)= I \f x , Y (t,s)-f x (t)f Y (s)\w(t,s)dtds, 

(16) 

(17) 



for a particular choice of weight function given by 
1 1 



where Cd 



w(t, s) 



7T 2 /r( 



c P c q \\t\\ 1+p \\ S \\ 1+CI ' 



1+d 



), d > 1. An important aspect 



of distance covariance is that V(X, Y) = if and only 
if X and Y are independent. We next obtain a similar 
statistic in the kernel setting. Write Z = X x y, and 
let k(z, z') = n(z-z') be a translation invariant RKHS 
kernel on Z, where k : Z — >■ K is a bounded continuous 
function. Using Bochner's theorem, k can be written 
as: 



k(z) 



u dA(u), 



for a finite non-negative Borel measure A. It follows 
Gretton et al. (2009b) that 

jI(Pxy,PxPy)= J \fx, Y (t,s)- f x (t)f Y (s)\ 2 dA(t, s ), 

which is in clear correspondence with (16). However, 
the weight function in (17) is not integrable — so one 
cannot find a translation invariant kernel for which 
coincides with the distance covariance. By contrast, 
note the kernel in (9) is not translation invariant. 

A. 4. Restriction on Probability Measures 

In general, distance kernels and their products are con- 
tinuous but unbounded, so kernel embeddings are not 
defined for all Borel probability measures. Thus, one 
needs to restrict the attention to a particular class of 
Borel probability measures for which kernel embed- 
dings exist, and a sufficient condition for this is that 
¥,z^pk 1 ^ 2 (Z, Z) < 00, by the Riesz representation the- 
orem. Let k be a measurable reproducing kernel on Z, 
and denote, for 9 > 0. 



Ml(Z) = \veM 



i{Z) : J k\ 



z,z)d\v\ (z) < 00 . (18) 
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Note that the maximum mean discrepancy jk{P,Q) 
is well defined VP, Q G M]! 2 {Z) nM\{Z). 

Now, let p be a semimetric of negative type. Then, 
we can consider the class of probability measures that 
have a finite 0-moment with respect to p: 

M e p (Z) = {v G M(Z) : 3z Q eZ, (19) 

s.t. / p e (z, zq) d \v\ (z) < oo}. 



To ensure existence of energy distance De. p (P,Q), 
we need to assume that P,Q G A4g(Z), as other- 
wise expectations Ezz<p{Z, Z'), F,ww'P{W, W) and 
^zwp(Z, W) may be undefined. The following propo- 
sition shows that the classes of probability measures in 
(18) and (19) coincide at 9 = n/2, for neN, whenever 
p is generated by kernel k. 

Proposition 18. Let k be a kernel that generates 
semimetric p, and let n G N. Then, M n k /2 (Z) = 
A4g^ 2 (Z). In particular, if k\ and ki generate the 
same semimetric p, then A4 7 ^ 2 (Z) = M k ^ 2 (Z). 

Proof. Let 9 > |. Note that a 26 is a convex function 
of a. Suppose v G M e k {Z). Then, we have 

J p 9 (z,Z )d\u\(z) 

= j \\K;z) - k{-,xa)\\n k d\v\(z) 

I (\\k(; Z )\\n k + \\k(; Zo )\\n k ) 2e d\u\(z) 

||*(-.^ll?? fc dM(*)+ f \\K;z a )\\™ k d\v\{z) 



< 2 



2 26 - 1 [ I k e (z,z)d\u\(z) + k e (z ,z )\u\(Z) 



< oo, 

where we have invoked the Jensen's inequality for 
convex functions. From the above it is clear that 
M e k {Z) C M s p {Z), for 9 > 1/2. 

To prove the other direction, we show by induction 
that M e p (Z) c M n k /2 (Z) for 6 > f , n G N. Let n = 1. 
Let > |, and suppose that ^ G M p (X). Then, by 
invoking the reverse triangle and Jensen's inequalities, 
we have: 



p e (z,z )d\u\{z) 



\\k(-,z) - k(-,z )\\u k d\v\{z) 



> 



> 



k l / 2 (z,z)-k 1 / 2 (z ,z ) d\v\{z) 
k 1 ' 2 {z,z)d\ V \{z)-\\v\\ TV k 1 / 2 {z ,z ) 



1 /2 

which implies v G M k (Z), thereby satisfying the 
result for n = 1. Suppose the result holds for 9 > 

2=1, i.e., 7W^(Z) C 7W^ _1)/2 (Z) for > 2=1. Let 

f G M e p (Z) for 6» > § . Then we have 

p 9 (z,z Q )d\u\(z) 

)\k(;z)-k(.,Z )\\lc k )" d\u\(z) 

\\k(-,z) - fc(-,Zo)||w fc d\v\(z) 
\\k(;z)\\n k -\\k(;z )\\n k \ n d\u\(z) 
(\\k(;z)\\n k -\\k(;z )\\n k ) n d\u\(z) 
I E("!) r (") \\K-^)\\u k r \\K-,^)\m k d\v\{z) 

J r=0 ^ ' 

k%{z,z)d\v\(z) 



k 2 (z, z)d\v\{z) 



£(-ir (;)**(*>,«>) I 



20 



Note that the terms in B are finite as for 9 > — > 
2^1 > ... > 1, we have M e p {Z) C A^"~ 1)/2 (Z) C 
••• C M\{Z) C X[. /2 (Z) and therefore A is finite, 
which means v e Ml' 2 {Z), i.e., M 9 p {Z) C Xfc /2 (Z) 
for 6> > §. The result shows that M%Z) = M 6 k {Z) 
for all 9 G {§ : n G N}. □ 



The above Proposition gives a natural interpretation 
of conditions on probability measures in terms of mo- 
ments w.r.t. p. Namely, the kernel embedding pk(P), 
where kernel k generates the semimetric p, exists for 
every P with finite half-moment w.r.t. p, and thus, 
MMD between P and Q, 7fc(P, Q) is well defined when- 
ever both P and Q have finite half-moments w.r.t. p. 
If, in addition, P and Q have finite first moments 
w.r.t. p, then the p-energy distance between P and Q 
is also well defined and it must be equal to the MMD, 
by Theorem 11. 

Rather than imposing the condition on Borel prob- 
ability measures, one may assume that the underly- 
ing semimetric space [Z, p) of negative type is itself 
bounded, i.e., that sup z z , eZ p(z, z') < oo, implying 
that distance kernels are bounded functions, and that 
both MMD and energy distance are always defined. 
Conversely, bounded kernels (such as Gaussian) always 
induce bounded semimetrics. 



Hypothesis Testing Using Pairwise Distances and Associated Kernels 



Table 1. MMD with distance kernels on data from Gretton et al. (2009a). Dimensionality is: Neural I (64), Neural II 
(100), Health status (12,600), Subtype (2,118). The boldface denotes instances where distance kernel had smaller Type II 
error in comparison to Gaussian kernel. 







Gauss 


dist (1/3) 


dist (2/3) 


dist (1) 


dist (4/3) 


dist (5/3) 


dist (2) 


Neural I 


1- Type I 


.956 


.969 


.964 


.949 


.952 


.959 


.959 


(m = 200) 


Type II 


.118 


.170 


.139 


.119 


.109 


.089 


.117 


Neural I 


1- Type I 


.950 


.969 


.946 


.962 


.947 


.930 


.953 


(m = 250) 


Type II 


.063 


.075 


.045 


.041 


.040 


.065 


.052 


Neural II 


1- Type I 


.956 


.968 


.965 


.963 


.956 


.958 


.943 


(m = 200) 


Type II 


.292 


.485 


.346 


.319 


.297 


.280 


.290 


Neural II 


1- Type I 


.963 


.980 


.968 


.950 


.952 


.960 


.941 


(m = 250) 


Type II 


.195 


.323 


.197 


.189 


.194 


.169 


.183 


Subtype 


1- Type I 


.975 


.974 


.977 


.971 


.966 


.962 


.966 


(m = 10) 


Type II 


.055 


.828 


.237 


.092 


.042 


.033 


.024 


Health st. 


1- Type I 


.958 


.980 


.953 


.940 


.954 


.954 


.955 


(m = 20) 


Type II 


.036 


.037 


.039 


.081 


.114 


.120 


.165 



A. 5. Distance Correlation 

The notion of distance covariance extends naturally 
to that of distance variance V 2 (X) = V 2 (X, X) and 
that of distance correlation (in analogy to the Pearson 
product- moment correlation coefficient): 



TZ 2 (X,Y) = 



' V 2 (X,Y) 
V(X)V(Y) ■ 

.0, 



V(X)V(Y) > 0, 
V(X)V(Y) = 0. 



Distance correlation also has a straightforward inter- 
pretation in terms of kernels as: 



Tl 2 {X,Y) = 



V 2 (X, Y) 



V(X)V{Y) 

iI(Pxy,PxPy) 
lk{Pxx,Px Pxhk (Pyy ,PyPy) 

WExy\\ 2 hs 



A. 6. Further Experiments 

We assessed performance of two-sample tests based on 
distance kernels with various exponents and compared 
it to that of a Gaussian kernel on real-world multi- 
variate datasets: Health st. (microarray data from 
normal and tumor tissues), Subtype (microarray data 
from different subtypes of cancer) and Neural I/II (lo- 
cal field potential (LFP) electrode recordings from the 
Macaque primary visual cortex (VI) with and without 
spike events), all discussed in Gretton et al. (2009a). 
In contrast to Gretton et al. (2009a), we used smaller 
sample sizes, so that some Type II error persists. At 
higher sample sizes, all tests exhibit Type II error 
which is virtually zero. The results are reported in 
Table 1 below. We used the spectral test for all ex- 
periments, and the reported averages are obtained by 
running 1000 trials. We note that for dataset Subtype 
which is high dimensional but with only a small num- 
ber of dimensions varying in mean, a larger exponent 
results in a test of greater power. 



J XX\\ H s W^yyWhs 



where covariance operator Sat : Hk x - ► 

is a linear operator for which (Exv /, g) H = 

E XY [f(X)g(Y)} - E x f(X)E Y g(Y), for all / e H kx 



and g £ Hk y , and 



IHS 



denotes the Hilbert-Schmidt 



norm (Gretton et al., 2005b). It is clear that 1Z is in- 
variant to scaling (X, Y) h- y (eX, eY), e > 0, whenever 
the corresponding semimetrics are homogeneous, i.e., 
whenever px(tx, ex') = epx(x,x'), and similarly for 
py. Moreover, TZ is invariant to translations (X, Y) \-t 
(X + x' , Y + y'), x 1 <G X, y' E y, whenever px and py 
are translation invariant. 



